Visualizing high dimensional descriptors of molecular structures

ABSTRACT

The distribution of chemical compounds in high-dimensional molecular descriptor space can be viewed in two dimensions by applying the projection method of this invention. This method has particular usefulness for viewing the relationships of a large number of compounds such as found in a large scale HTS or virtual combinatorial library. After selecting a representative subset of the larger data set of compounds, initially components from the high-dimensional descriptor space are determined by PCA. In order to relax an NLM projection using the PCA components as a start, the stress function is modified to reflect a local horizon beyound which the separation of the compounds is not meaningfully measureable. The resulting two dimensional projections provide a clear insight into the distribution of the chemical compounds in the higher dimensional space. The method is clearly generalizable to viewing descriptor space in three dimensions and to using high dimensional descriptors other than those used to describe molecular structure.

[0001] A computer program listing appendix is part of the disclosure andis incorporated herein by reference. The computer program listingappendix contained on compact disks contains the following files:Identification Information (1 KB) and NLMJER.C (20 KB). The disks werecreated on Oct. 27, 2003.

BACKGROUND OF THE INVENTION

[0002] A portion of the disclosure of this patent document containsmaterial which is subject to copyright protection. The copyright ownerhas no objection to the facsimile reproduction by anyone of the patentdocument or the patent disclosure, as it appears in the Patent andTrademark Office patent file or records, but otherwise reserves allcopyright rights whatsoever.

[0003] 1. Field of the Invention

[0004] This invention relates to the field of computational molecularstructural analysis of large data sets of molecular structures and morespecifically to graphical displays that present an accurate qualitativerepresentation of the distribution of molecular structures in the highdimensional space of molecular descriptors.

[0005] 2. Background of the Art

[0006] With the advent of high throughput screening (HTS), combinatorialsynthesis, and 2 0 analysis and selection of compounds from computergenerated virtual libraries, research scientists, and pharmaceuticalscientists in particular, are faced with an expanding problem ofseparating compounds of most significance to their work from a clutterof possibilities. In recent years an appreciation has developed that: 1)it is useful to think about how molecular structures populate a“diversity space” of all possible structures; 2) that structuresgenerated from different synthetic routes may populate the same ordifferent volumes of diversity space; and 3) that broad based screeningprograms should utilize compounds from across diversity space and avoidoverscreening with compounds that densely occupy the same volume ofdiversity space.

[0007] Scientists in drug discovery research make decisions each daythat affect the course of their projects. A decade ago, decisions werebased on infrequent new biological data, and resulted in making smallnumbers of compounds per year. Today, high throughput screeninglaboratories generate a constant stream of new biological data and callfor larger numbers of new compounds to be made ever faster bycombinatorial chemistry laboratories.

[0008] Decisions about which compounds to acquire or synthesize to testnext are based in part on the output of computations utilizing advancedmolecular structural descriptors. The simplest drug discovery principleis that compounds similar in enough properties are usually similar inbiological activity. Similarity often involves measures inhigh-dimensional spaces, such as molecular fingerprints or shapedescriptors which typically utilize around one-thousand dimensions. Usesof similarity in drug discovery research may apply thesehigh-dimensional descriptors to millions of compounds from virtuallibraries of potentially synthesizable compounds or to libraries ofsynthesized compounds which have been generated.

SUMMARY OF THE INVENTION

[0009] The method of this invention enables scientists to examinerelationships among the vast numbers of compounds in high-dimensionaldiversity space in a familiar two-dimensional visual map context. Themethod for visualization of high-dimensional diversity spaces relies onthe implementation of horizons, which are distances beyond which thedistance matrix between compounds need not be resolved, and on efficientsubsampling methods. The method also enables the selection of optimaldescriptors to cluster compounds for predictive use when combined ingenetic algorithms. Optimal descriptors help not only in visualizingimportant features of diversity space, but in deciding which compoundsto make and test next during early analoging of active substances.

BRIEF DESCRIPTION OF THE DRAWINGS

[0010]FIG. 1 shows a schematic outline of the process of the invention.

[0011]FIG. 2 shows a typical two dimensional projection using the methodof the invention.

[0012]FIG. 3 shows the virtual reaction which defining thesulfonylpiperidine urea combinatorial library.

[0013]FIG. 4 is a schematic illustrating the application of OptiSimmethodology to combinatorial sub-library design for a two-componentreaction defined by A+B→AB. Upper case letters correspond to selectedreagents; lower-case letters denote candidate reagents in subsamplesconsidered at each step, with cells shaded to indicate the order inwhich products are added to the design. Block dimensions are set at 3×4and k is set to 3 for illustrative purposes.

[0014]FIG. 5 shows projections of fingerprints for a 300 compoundOptiSim subset (k=3) of the sulfonylpiperidine urea into two dimensions.Paired symbols indicate more closely related compounds, whereas circlescorrespond to relatively isolated ones. Structures for compoundsrepresented by highlighted points are given in FIG. 5. (A) Map based onscores from the first two components of a principal components analysis(PCA) using Euclidean distances between fingerprints. (B) Non-linear mapobtained from the coordinates in (A) using Soergel distances and thestress function given in Equation 2.

[0015]FIG. 6 shows structures for the particular sulfonylpiperidineureas highlighted in FIGS. 4 and 5. Numbers in parentheses indicate theOptiSim selection index for each product. X denotes the piperidyl core.

[0016]FIG. 7 shows non-linear maps for the 300 compound OptiSim subset.Initial coordinates obtained from PCA were relaxed by minimizing themodified stress function given in Equation 3. Highlighted points referto the structures shown in FIG. 3. (A) h=0.65. (B) h=0.5. (C) h=0.4. (D)h=0.3.

[0017]FIG. 8 shows a Non-linear map for the 300 compound OptiSim subsetobtained with h=0.3. Highlighted products were selected to illustratethe relative distribution of structural classes across the map.

[0018]FIGS. 9A and 9B show a non-linear map for combinatorialsulfonylpiperidine urea sub-libraries. Each sub-library was comprised of200 products, of which 100 were chosen at random and projected togetherusing h=0.3. “Cherry picking” indicates OptiSim selection, whereassingle- and four-block designs were created using an extension ofOptiSim described in the text. A subsample size k=5 was used ingenerating each of the three designs.

[0019]FIG. 10 shows non-linear maps showing projections of biologicalactivity and pharmacophoric structure into fingerprint space for aproprietary library of potential kinase inhibitors with respect to aspecific kinase target. Large symbols indicate actives, whereas smallsymbols denote generic inhibitors which failed to inhibit the targetenzyme. Specific actives are highlighted as circles and squares. (A) PCAmap for 100 actives selected at random together with 300 randomlyselected inactives. (B) “Classical” NLM (h=1.0) obtained starting fromthe PCA coordinates in A. (C) Modified NLM obtained using an horizonh=0.3. (D) Map for actives and inactives “hit” in a UNITY 3D flex searchrun against a query built from a particular pharmacophore model of thetarget enzyme's active site.

DESCRIPTION OF THE INVENTION

[0020] Computational Chemistry Environment

[0021] Generally, all calculations and analyses to generate thevisualizations of this invention are implemented in a moderncomputational chemistry environment using software designed to handlemolecular structures and associated properties and operations. Forpurposes of this patent document, such an environment is specificallyreferenced. In particular, the computational environment andcapabilities of the SYBYL and UNITY software programs developed andmarketed by Tripos, Inc. (St. Louis, Mo.) are specifically utilized.Unless otherwise noted, all software references and commands in thefollowing text are references to functionalities contained in the SYBYLand UNITY software programs. Where a required functionality is notavailable in SYBYL or UNITY, the software code to implement thatfunctionality is provided in an Appendix to this Application. Softwarewith similar functionalities to SYBYL and UNITY are available from othersources, both commercial and non-commercial, well known to those in theart. A Java enabled computing environment for graphical interface isalso referenced. A general purpose programmable digital computer withample amounts of memory and hard disk storage is required for theimplementation of this invention. In performing the methods of thisinvention, representations of thousands of molecules and molecularstructures as well as other data may need to be stored simultaneously inthe random access memory of the computer or in rapidly availablepermanent storage. The inventors use a 150 Mhz R4400 SGI computer withan R4010 floating point processor, 128 Mbytes of memory, disk spacelocally and on a network with no specific quota, access to graphics fromother SGI consoles as well as via X windows on PCs and X terminals.

[0022] Definitions:

[0023] Explicit library: a collection of compounds in which eachcompound has an explicit structure. Corporate compound library databasesat pharmaceutical companies fall in this category.

[0024] Fingerprints: a vector of binary variables that represents thepresence or absence of 2D molecular fragments in a molecule. In thispatent document fingerprints refer specifically to the 988 binaryvariables used for the past several years in the Unity structuraldatabase definition, in which all possible fragments of length 2 to 6are hashed together and key heteroatoms (O,N,S,P, Si, halogens) andrings are counted.

[0025] Horizon: a distance beyond which all points areindistinguishable.

[0026] NLM: non-linear mapping. This algorithm attempts to minimize theoverall fractional error in preserving the actual distances in manydimensions when going to fewer dimensions.

[0027] Modifying this algorithm is a key part of the present invention.

[0028] PCA: principal component analysis. This mathematical method isused to select an initial guess for the coordinates of compounds in thevisualization.

[0029] Singleton: a point with no neighboring points nearby. In thecontext of a distance horizon, any compound that has no other compoundcloser to it than the horizon is a singleton.

[0030] Tanimoto: similarity measure between two fingerprints, rangingfrom 0 (no similarity) to 1 (perfect similarity). It is computed as:1−(#bits in common)/(#bits in either) A Tanimoto derived distance iscomputed as 1−Tanimoto.

[0031] Virtual library: a collection of compounds that exists only incomputer representations.

[0032] In this patent document virtual libraries more specifically referto collections of all products that can be made by combining allsuitable reagents in specific synthetic reactions, or to subsets of suchproducts which meet additional criteria such as an upper bound onmolecular weight.

[0033] Description:

[0034] The problems of generating a two-dimensional display ofhigh-dimensional diversity space involve the same type of considerationsand limitations encountered with familiar geographic mapping. Accuratelydepicting points from a 1000 dimension space in two dimensions isimpossible, as is preservation of distance/angle/area information whenmapping the earth's curved surface onto the two dimensional plane of apiece of paper. For instance, a Mercator projection accurately maintainsposition and angular information but loses accurate area representationmaking high northern or southern land masses disproportionately largecompared to mid-latitude areas. A homolosine projection on the otherhand, preserves area relationships accurately, but loses otherinformation.

[0035] The important point is that any two dimensional map preserve thefeature/relationships critical to its particular use. In the presentinvention, the two dimensional maps preserve useful information aboutthe distance relationships of compounds in diversity space. Inparticular, care is taken to preserve neighbor relationships by means ofthe horizon approach. A horizon is a distance beyond which all pointsare indistinguishable. Just as an unaided eye cannot see objectsobscured by the earth's curvature, the neighborhood principle assertsthat when compounds are dissimilar enough, there is no information inquantifying that dissimilarity. Further, when molecular descriptors areemployed which posses a neighborhood distance (validly relate descriptorspace to biological properties), it is possible to relate biologicalactivity distributions across the two dimensional plot.

[0036] The visualization method of this invention is based on two keyideas. First, large numbers of compounds can be represented by plottingonly a subset of compounds that represent compact clusters. Second, theimportant information is contained in short range distances between nearneighbors. The preferred manner of practicing the method of thisinvention combines the sampling ability of the OptiSim methodology,standard PCA techniques of component projection, and a modified methodof applying NLM with a modified stress function which uses the horizonto relax the mapping constraints. The methodology of the presentinvention is implemented in a computational environment where manyprograms may be used to display the scatter plots output by theprojection and Java or other display environments may be used to displaythe results in an interactive manner. FIG. 1 shows the overall process:

[0037] Step A: Select the set of compound structures to be visualized.This may be one or more virtual libraries as well as one or moreexplicit libraries.

[0038] Step B: Compute a vector of molecular descriptors for eachcompound.

[0039] Step C: Generate a distance matrix between all compounds orutilize a function to generate the distance matrix elements as needed.

[0040] Step D: Compute a hierarchy of clusters, defining cluster centersand partitioning each set of compounds at each level. For small datasetsthis is not needed (equivalent to having each compound be alone in itscluster). For virtual libraries, which may contain millions ofcompounds, selection of representative subsets is both computationallynecessary and a prerequisite for legibility of displays.

[0041] Step E: Perform a PCA projection onto the first two components.This provides an initial placement of compounds onto (x,y) coordinates.In the case of fingerprints, it also serves to spread out compounds in auseful way.

[0042] Step F:Run the NLM refinement of initial coordinates. The usualobjective function in this algorithm has been modified for the currentpurposes to include a horizon limitation.

[0043] Step G: Create a graphical display from the coordinates of eachcompound. Do so such that the chemist can easily see which compounds aresingletons and can tell which set of compounds each point came from.

[0044] Additional Display—Step H:

[0045] If desired, features of the display environment could provideaccess to information useful to explore the points in the twodimensional plot. A display implemented in Java could service graphicalinquiries such as:

[0046] 1. How many compounds are represented by a specified clustercenter compound?

[0047] 2. How far apart are two compounds?

[0048] 3. Where is this named compound in the graph?

[0049] 4. What is the structure of this compound?

[0050] 5. What is the nearest point to this compound in the “real” highdimensional space?

[0051] Possible additional Step I:

[0052] Subset Reprojection—Iterate to visualize subsets of the currentgraph, the purpose being:

[0053] 1. To obtain more accurate depictions of a portion of thedisplayed diversity space.

[0054] 2. To drill down into more detail by expanding selected clustercenters into all compounds that fall into the cluster partition.

[0055] The process results in a display such as shown in FIG. 2. In thiscase the intent is to compare compounds which come from three distinctchemical series (chalcones, styryls, and phenylquinolones). The threeseries are divided into three clouds in the two dimensional projection.In this projection, the series are well separated. For this Figure, thechemist selected the compound q35 and requested that the nearestcompound in each group be highlighted in the graph; the points mol49,67dimethoxystq, and q38 are displayed as 2D structures in the rightpanels and the distance in the true fingerprint space from q35 to eachis printed in the one line text window immediately below the graph.

[0056] As noted above, accurately depicting points from a 1000-D spacein 2-D is impossible. We can achieve a useful level of success, however,by two related observations: we mostly care about preserving neighborrelationships, and we especially look for “overlap” of one set ofcompounds with another. The neighborhood issue has resulted in novelrelaxation of mathematical constraints, while the overlap interest hasled to novel biased selection methods from very large virtual librarycollections.

[0057] While it is believed that PCA/NLM has ever been used withfingerprints before, FIG. 2 also illustrates a critical differencebetween the visualization method of this invention and a “traditional”PCA/NLM type projection. The visualization method of this inventionassumes that when two compounds are beyond each other's horizon—whenthey are far enough apart—then the exact distance between them isunimportant and need not be preserved. Specifically, it is most usefulto run with a horizon of 0.30 in Tanimoto distance. Long range distancesare ignored. This is evident in the graph where clusters appear to beseparated by more than 1.0 units even though the largest possibleTanimoto distance is 1.0. So long as the compounds actually differ byO.30 or more, there is not penalty for displaying them arbitrarilyinfinitely far apart.

[0058] Previous work by Patterson et al. revealed that when twocompounds are more than 0.85 similar by the Tanimoto metric offingerprint similarity (or at a distance of less than 0.15=1.0−0.85 inthis graph) then they are likely to also show similar biologicaleffects. At twice this distance, there is little or no predictiveinformation about the activity of one compound to be obtained fromknowing the biological activity of its partner.

[0059] In the original formulation of NLM (Sammon, 1969), the objectivefunction to be minimized is the sum of squared fractional deviationsbetween the distance matrix in the original high dimensional space andthe distance matrix in the projected space: (True—Projected)/True. Asmall value is used in the denominator to avoid division by zero whennecessary. In the modifications that have proven to work in the methodof this invention, distances within the horizon are preserved:

[0060] Both “true” and “projected” distances are replaced withmin(horizon, distance). This modification tends to make all truly closecompounds look close in the projection. This is the minimal objectivefor the method: the structures should “look close if they really areclose”.

[0061] Thus a true distance of 0.35 and a distance in the 2D projectionof 1.52 has a penalty of 0, since both true and projected are replacedwith the same value, 0.30, yielding a fractional deviation of 0.However, a true distance of 0.30 with an apparent distance in thevisualization of 0.03 has a relatively large fractional deviation of 90%and the NLM iterations will attempt to correct this after the true smalldistances are corrected. The usual NLM algorithm would spend its timetrying to move the compounds which have true distances larger than 0.30but apparent distances substantially larger. The principle modificationof this method, imposing a horizon on distances of 0.30, does a good jobin the short range while allowing large deviations to exist near andbeyond the horizon.

[0062] As noted earlier, chemists are today faced with analyzinglibraries which may contain millions of compounds. Clearly, graphicaldisplay of such vast number of data points in a meaningful way isimpossible. For the purposes of this invention, generally only a fewthousand data points at most can usefully be displayed on the screen.However, a representation of the distribution of the compounds indiversity space can be achieved by properly selecting compounds from thedata set. The visualization graph of this invention is much like ageographical map. One does not expect to see a map of North America toshow individual homes, or even every small town. As the map is narrowedto look at small regions such as a state or county, more detail isexpected to appear in order to match the objectives of the viewer.Beyond 2000-5000 points the data obscure each other too much forproductive use. It is not possible on most graphics screens to discernand select more than about 30,000 distinct points with uniform spacing.Since much information is in the holes as well as the points, the numberof points suitable for display in any one graph is at most a fewthousand. The limiting step for larger datasets is the partitioning ofthe compounds into one or more levels of clusters. Each level willcontain a manageable number of points to graph.

[0063] The OptiSim method (Clark, 1997; Clark and Langton, 1998) is amethod developed for the purpose of rapid clustering of large datasets.By varying key parameters, the selections can be made to vary frommaximum dissimilarity, which is useful when the extreme edges ofdiversity space are of special interest, through complete linkagehierarchical clustering, which generates representative subsets. TheOptiSim method is applied in the present invention primarily to generatesubsets which are representative in the sense of partitioning the entireset of compounds into clusters of roughly equal volumes in the highdimensional space. However, the use of the OptiSim method can be variedaccording to which question is important at the moment: to seeunexpected compounds which can be made from a specific reaction andavailable reagents, the maximum dissimilarity parameters are best.

[0064] To display a full combinatorial library, which typically consistsof one billion similar structures, the library would be clustered onmultiple levels with each point representing roughly 1000 structures oneach level. The full visualization would then have the library at thetop level with 1000 cluster centers, each one representing 1000subcluster centers packed within the horizon, each containing about 1000extremely similar compounds. The scientist would be able to see theoverall distribution at the top level, could see much more detailedviews of a part of the map when desired, and could go to a final levelof individual compounds of the billion if appropriate. The zoomingoperation would be reasonably intuitive. Extension to multiple levels isstraightforward and within the ability of a practitioner in the art.

[0065] Example Application of Method:

[0066] The substructural fingerprints used in this example are binaryvectors (bitsets) in which each element is set to 1 or 0 to indicate thepresence or absence, respectively, of some substructural element in thecorresponding molecular structure. The mapping is one-to-one for thesubstructure keys distributed by MDL,¹ whereas Daylight² fingerprintsare hashed such that particular bits can be set by any of severaldifferent, unrelated substructures. UNITY®³ fingerprints arequalitatively intermediate, in that only related substructures—e.g.,alkyl fragments—get hashed together.

[0067] Fingerprints were originally developed to speed up 2D searches ofchemical databases,⁴ but recent work has made it clear that suchfingerprints also work remarkably well for assessing similarities anddifferences between molecules in a biochemically meaningfulway.^(5,6,7,8,9) Because the bit string operations underlying theirmanipulation are very fast, fingerprints are particularly appealing astools for dealing with the large amounts of data produced by the highthroughput screening (HTS) and combinatorial chemistry programscurrently underway at many pharmaceutical companies. In particular, onewould like to present the relationship between sets of fingerprints insuch a way that the full power of human pattern recognition can bebrought to bear for elucidating structure-activity relationships (SARs).

[0068] Unfortunately, fingerprints do not lend themselves naturally tovisualization, in part because of their high dimensionality. Indeed, itseems likely that their high dimensionality is directly related to theirgood neighborhood behavior—the fact that molecules with very similarfingerprints are very likely to exhibit similar biochemical properties.⁷There are simply too many ways for large numbers of compounds to bemutually distinct to be conveyed with complete accuracy in any lowdimensional display space.

[0069] A second complication lies in the fact that the Euclideandistances to which people are accustomed are not the best way to measuredistances in fingerprint space. This is because any particularsubstructure (e.g., a pyrazole ring) is much more relevant in terms ofmedicinal chemistry when it is found in one or both of two moleculesthan when it is absent from both. Hence distances (dissimilarities)between two fingerprints are more meaningfully assessed^(10,11) usingthe Soergel¹² distance d given by: $\begin{matrix}{{d( {a,b} )} = {{1 - {T( {a,b} )}} = \frac{{a\bigcup{b{{- {{a\bigcap{b}}}}}}}}{{a\bigcup{b}}}}} & ( {{Equation}\quad 1} )\end{matrix}$

[0070] where a and b are the fingerprints of interest, the double barsindicate cardinality, and T(a,b) is the Tanimoto similarity coefficient.Note that this distance measure runs from 0 to 1, and that bits whichare set to 0 in both fingerprints do not contribute. Taken together,these considerations serve to reduce the effective dimensionality aroundeach fingerprint, which helps to counteract the “curse of highdimensionality” referred to above.

[0071] According to Patterson et al.⁷ and that of others,⁵ two moleculesseparated by a Soergel distance of 0.15 or less (corresponding to aTanimoto similarity coefficient of 0.85 or more) are likely to exhibitbiological activities within two orders of magnitude of each other,which makes them substantially redundant in terms of HTS. Hence, 0.15 isgenerally used as an exclusion radius when selecting subsets from acombinatorial library.

[0072] Example Methodology: The Sulfonylpiperidine Urea Library

[0073] Consider, for example, the virtual library defined by thereaction shown in FIG. 3, which could be used as a platform from whichto design generic screening sub-libraries. The 4-aminopiperidinescaffold upon which the full library is built is not commerciallyavailable, but it is a known compound. A UNITY substructure search ofcommercially available reagents was run and the candidate reagentsobtained were screened in ChemEnlighten¹⁴ for desirable physicalproperties.

[0074] UNITY 2D searches were restricted to molecules containing no morethan ten rotatable bonds, and reagents containing the substructuralfragments listed in Table 1 were excluded by using the -notlist optionin dbsearch. Note that a moderate level of potentially interferingfunctionality (e.g., single free hydroxyl groups) was permitted, theassumption being that a modest investment in protection andde-protection chemistry could be accommodated. The primary amine andsulfonyl chloride hitlists obtained were then loaded into ChemEnlightendatabases and filtered for the physical property limits listed in Table2. A total of 308 distinct primary amines passed the filters, as did 154sulfonyl chlorides, so the full library encompassed 47,432 products.

[0075] The filters applied were chosen with an eye towards generatingproducts with generally drug-like properties,¹⁵ and succeeded reasonablywell—91% of the products in the resulting library had a molecular weightless than or equal to 550 (68% less than or equal 500), and 95% returneda CLogP of 5.0 or less. Most contained one or two aromatic rings (38 and46%, respectively).

[0076] Additional filters are, of course, involved in creating “real”libraries, but those used here are stringent enough to ensure that thedistribution of substructural features in the resulting library isrealistic. In addition, they produce a range of products whichillustrate the behavior of visualization methods at hand. The productlibrary is also realistic in that it is flexible enough to explore aninteresting range of binding site geometries, but not so flexible thattight binding is likely to be precluded by the entropic cost of“freezing out” rotatable bonds.

[0077] OptiSim Subsets

[0078] It is not necessary to project data points for all 47,432products from fingerprint space simultaneously to get a good idea of thevarious structural relationships which exist between the compounds whichmake up the library. Indeed, it is impossible to fully resolve that manypoints even in three dimensions, let alone in the two dimensions towhich one is restricted on a computer screen or in print. Instead, asubset can be selected in such a way that it is representative of thosecompounds not shown, and which provides a useful mechanism for “drillingdown” to any required level of resolution.

[0079] This can be accomplished by examining a random sample, which is,indeed, quite efficient if the structures are uniformly distributed orif one is looking at more than 10 or 20% of all the compounds in a givendata set. Unfortunately, combinatorial libraries are often ratherunevenly distributed across the region of fingerprint space spanned byeach, in that distances between clusters of related products varydepending on the relative structural complexity of the substituents(alkyl vs phenyl vs azoles) and the nature of their linkage to thecombinatorial core, as does the “density” of each cluster. Hence arandom sample large enough to cover the space adequately tends toproduce at least one area where the point density is too high to beuseful for evaluating the co-localization and segregation of, forexample, activity classes.

[0080] Subsets obtained by applying the OptiSim methodology^(17,18,19)to a large library are more informative, however, in that they arerepresentative enough to give a good sense of the distribution ofstructures within a library, yet diverse enough to accurately convey itscoverage of the available structural space. Such selection sets arebuilt up by pulling the best representative from a series of candidatesubsamples and adding it to the set of compounds already selected.Subsample sizes k of 3 to 5 generally work well, so creating selectionsets is very fast. Using OptiSim selection is also convenient in thatthe library need not be fully enumerated: selection can instead be madedirectly from a combinatorial definition—e.g., from a combinatorialSLN²⁰ (CSLN in SYBYL Line Notation).

[0081] An initial subset of 300 compounds was drawn from thesulfonylpiperidine urea library by running OptiSim with an exclusionradius (distance below which compounds are considered redundant) of 0.15and a subsample size k=3. Working from a subset has the side benefit ofreducing the effective dimensionality of the problem to a considerabledegree, since the underlying level of dimensional complexity is alwaysless than the number of compounds being examined. In this example, thattranslates to a potential reduction from 988 dimensions (the number ofbits in a standard UNITY fingerprint) to 300 or less.

[0082] Combinatorial Sub-Libraries:

[0083] Briefly described below is an example of how a combinatorialsub-library could be selected for ultimate use with the method of thisinvention. The method of comparing combinatorial sub-libraries using thetwo dimensional projections implemented by the method of this inventionwill be described later in this patent document.

[0084] Combinatorial sub-libraries were generated by applying theOptiSim¹⁷ extension illustrated in FIG. 4. The process is seeded bychoosing one product at random, which specifies the first reagent pairA₁B₁. At each step, new reagents are chosen at random from the list ofthose available and the products produced from each by reaction with thecomplementary reagents which have already been specified are examined.That reagent whose products compare most favorably to the sub-librarywhich has been built up so far are added to the selection list for theappropriate reagent. What exactly “most favorable” means is veryflexible; it may simply mean most diverse, but can also involveconsiderations of cost or synthetic compatibility.

[0085] In FIG. 4, the subsample size k is set to 3 for illustrativepurposes, and a 3×4 pattern has been specified. Compound A₁B₁ isselected at random to seed the process. Reagent candidates a₂₁,a₂₂ anda₂₃ are then considered by comparing a₂₁B₁,a₂₂ B₁ and a₂₃ B₁ to A₁B₁.That candidate which produces the best set of products (most diverse,cheapest, best average expected activity, etc.) specifies A₂. In thenext step, three candidate reagents B are selected: b₂₁,b₂₂ and b₂₃.Each candidate will now give to rise to two products—A₁b_(2i) andA₂b_(2i)—which get evaluated against A₁B₁ and A₂B₁.

[0086] Selections from the reagent lists alternate until one of thespecified block dimensions is reached; the corresponding reagent is thenskipped over until the full block is filled out. Once a block iscompleted, a new seed is chosen by picking k candidate compounds atrandom and comparing them to the products in the blocks which havealready been specified. The process then continues as for the firstblock until the required number of products have been specified or novalid selections remain.

[0087] Note that no products from reactants selected for earlier blocksare considered in selecting the seed product (e.g., A₄B₅ in FIG. 2)which starts a new block, and that all products in preceding blocks areconsidered when evaluating candidates for subsequent blocks. In FIG. 4,for example, similarity of a₄₂B₅ to A₂B₃ may militate against theselection of a42 as A₄.

[0088] Three 200 member sub-libraries were created using a combinationof customized code in SYBYL²¹ Programming Language (SPL) andcommercially available functions from the Legion™ combinatorial buildermodule of SYBYL. The value of k was set to 5 and block dimensions wereset to 1×1 (“cherry picking,” which is identical to ordinary OptiSimselection), 10×5 (“four blocks”) or 20×10 (“single block”) for primaryamines and sulfonyl chlorides, respectively.

[0089] Reagent subsamples were chosen at random with uniform probabilityfrom among those for which no anticipated product fell within anexclusion radius of 0.10 of any product already specified. Candidatereagents were selected with replacement, and so could be selected forinclusion in several different blocks. In fact, only 32 primary aminesare called for in the “four blocks” design, because four contributed totwo different blocks and three appeared in three blocks. No sulfonylchlorides were used more than once, so the design would require a totalof 52 reagents versus the 30 used in the single block design.

[0090] Roulette wheel selection weighted by price, supplier, etc. caneasily be incorporated into the subsample selection process, as cancategorical exclusion criteria such as physical property cutoffs(“druggability”).¹⁵

[0091] For the libraries described here, candidate reagents were ratedsimply on the basis of diversity. In particular, the MiniMax criterionwas used to select the best candidate at each stage: that reagent wasselected for which the maximum Tanimoto similarity to anyalready-specified product was smallest. Other metrics (e.g., smallestaverage cosine coefficient) can be used in place of MiniMax Tanimoto,and non-structural criteria can be incorporated into the fitnessfunction if desired.

[0092] A thorough characterization of the library designs obtained usingOptiSim in this way is beyond the scope of this patent document, butseveral salient points bear mentioning:

[0093] Replacement of “bad” reagents which slip past the filters simplyentails re-running the corresponding step in the analysis whileincluding products specified at subsequent steps when evaluatingreplacement candidates; replacing B₄, for example, would involvecomparison of its products with A₅B₈, A₁₀B₄., etc. as well as with A₁B₁and A₃B₂.

[0094] Extension to reactions involving more than two reagents isstraightforward.

[0095] Perhaps most interesting is the use of roulette wheel selectionin place of uniform random sampling for choosing subsample candidates.Introducing a particular bias (e.g., towards cheaper reagents) whendeciding which subsample of reagents to consider next can produce quitedifferent results from those produced by adding analogous terms to thefitness function used to select the “best” candidate from eachsubsample.

[0096] Note that sublibraries obtained in this way are bothrepresentative and diverse, in the same sense that OptiSim selectionsets are.^(18,19) For any given block layout, the balance between thetwo characteristics is set by the value chosen for k: smaller subsamplesizes give more representative sublibraries and larger subsample sizesgive more diverse ones.

[0097] PCA and NLM Projections

[0098] Principal components analysis (PCA) has seen extensive use indiversity analysis.^(23,24) FIG. 5A shows the projection obtained byextracting the first two principal components from the fingerprint spacefor the 300-compound OptiSim selection set described above. This subsetincludes eleven compounds which have no neighbors within a Soergelradius of 0.3, beyond which biochemical similarity falls off rapidly;their positions in the plot are highlighted as open circles. It is notat all obvious by inspection of the principal components projection thatthese eleven compounds are structurally isolated. In fact, they all tendto fall into the central areas of the map.

[0099]FIG. 6 includes the corresponding structures, which are numberedin parentheses in the order in which they were brought into the OptiSimselection set; “X” in each chemical structure denotes the sharedpiperidine core.

[0100] The PCA map can be modified to better reflect the real pairwisedistances within the data set by applying a non-linear-mapping technique(NLM) developed originally by Sammon²⁵ and subsequently extended byKowalski and Bender²⁶ and by others.^(27,28,29) In this approach, thePCA coordinates are perturbed so as to minimize some stress function.FIG. 5B shows the result of doing this for the sulfonylpiperidine ureasusing Sannnon's original stress function S: $\begin{matrix}{S = {\sum\limits_{i}{\sum\limits_{j > i}\frac{( {d_{i\quad j}^{*} - d_{i\quad j}} )^{2}}{d_{i\quad j}}}}} & ( {{Equation}\quad 2} )\end{matrix}$

[0101] where d_(ij)* is the distance between points i and j in theprojection, and d_(ij) is the distance between i and j in the originalspace. Here, we are interested in the Soergel distance.

[0102] The isolated points have been displaced towards the edge of themap, which is clearly desirable. This improvement comes, however, at thecost of reducing the anisotropy of the map—the distinctive shape of aPCA projection is characteristically reduced or lost altogether ingenerating a non-linear map from a high-dimensional space, particularlyfor data sets as inherently symmetrical as combinatorial libraries.

[0103] Many near neighbors in the fingerprint space are also nearneighbors in both projections (not shown), but many have been pulledapart in the PCA or the NLM projection, or in both. Examples include theother ten compounds highlighted in FIG. 5A and 5B, which have beenpaired up by similarity; their structures are also shown in FIG. 6. TheSoergel distances separating 12 from 20, 10 from 14, 19 from 21, 4 from8, and 16 from 18 are 0.243, 0.249, 0.271, 0.304 and 0.339,respectively. These separations are small enough to imply a substantialpotential for similarity in biological activity but large enough thatdifferences in potency can be expected to exceed 100-fold. Such pairsform the bridges which link structural islands of biological activity,so getting an accurate presentation of their relationship to each otheris critically important.

[0104] A Modified NLM

[0105] Unfortunately, the relatively large separations which dominatethe NLM in FIG. 5B are precisely those which carry the least amount ofuseful information; it is the local similarity which matters most. Oncethe Soergel distance between two fingerprints gets much beyond 0.4, onecan conclude that the corresponding structures are different, but notreally how different they are.³⁰

[0106] This consideration has been incorporated into the NLM in themethod of this invention by modifying the stress function so that eachcompound only “sees” compounds which lie within a neighborhood of radiush around it. This has been accomplished by replacing each of thedistance terms in the numerator of Equation 2 with the distance h to thehorizon whenever two compounds are far apart (Equation 3).$\begin{matrix}{S = {\sum\limits_{i}{\sum\limits_{j > i}\frac{( {{\min ( {h,d_{i\quad j}^{*}} )} - {\min ( {h,d_{i\quad j}} )}} )^{2}}{d_{i\quad j}}}}} & ( {{Equation}\quad 3} )\end{matrix}$

[0107] Sacrificing long-range interactions in this way allows the NLM torelieve stress by unfolding. This is illustrated in the displays of FIG.7, which shows NLM plots created by minimizing the modified stressfunction defined in Equation 3 as h is reduced from 0.65 down to 0.3.

[0108] Compounds which do not fall within the horizon of any othercompound in the subset being examined cannot be placed meaningfully intothe projection and so are set off to the edge of the plot (shadedcircles in FIG. 7C and 7D). Two compounds—2 and 13—are excluded at h=0.4(FIG. 7C) but compounds 12 and 20 remain well-separated, as, to a lesserextent, do compounds 4 and 8. Upon contracting the horizon still furtherto h=0.3, the remaining nine isolated compounds are pushed off the map,whereas all five problem pairs cluster appropriately.

[0109] The acid test for any visualization method is its ability toorder structures in a way which makes sense to a medicinal chemist. FIG.8 again shows the projection for the 300 compound OptiSim selection setat h=0.3, but with different compounds highlighted to illustrate therather “natural” layout of substructures produced by the introduction ofan horizon.

[0110] As one might expect from the chemistry involved in production ofthe respective reagents, benzenesulfonyl chlorides and benzylaminesdominate the pools of available reagents. Their mutual prevalence isreflected in the dense clump of diaryl compounds (e.g., 22 and 23) inthe upper left quadrant. Those rare compounds such as 33 and 34, whichlack aryl groups altogether, co-segregate in the sparsely populated areato the right of center in the map, whereas alkylamino arylsulfonamides26, 32, 38 and 39 occupy the center and center left.

[0111] Arylamino alkanesulfonamides 35-37 fall into the upper rightquadrant, with the more aliphatic 35 positioned towards the bottom ofthe cluster, near the non-aryl 33 and 34. Thiophenes and azoles (e.g.,27-31) appear in the lower left quadrant. Compound 28 is a particularlydistinctive compound and so shows up at the periphery of the plot, nearthe less unusual 5-isoxazolythiophene-2-sulfonamide 27. The“reasonableness” of such distributions, which is intuitively appealingto medicinal chemists but which in the past has been difficult orimpossible to quantify, now has a firm analytical footing in thevizualization method of this invention.

[0112] Comparing Combinatorial Sub-Libraries

[0113] Relationships among two or more libraries are best visualized byprojecting them into a common NLM, but using fingerprints from all 600compounds in the individually selected, four block and single blocksub-libraries described above produces an unnecessarily overcrowded map.Instead, 100 compounds were drawn at random from each sub-library. Thethree samples obtained were then pooled, and projected together usingh=0.3 to create the map shown in FIG. 9.

[0114] This plot clearly supports the expected conclusion³² that thesub-library of individually selected compounds (cherry picking design)is the most diverse, whereas the single block design is the leastdiverse and, concomitantly, the most redundant. One indication of thisis the eight representatives from the cherry picking library whichappear along the edge of the plot, indicating that they fall beyond thehorizon of any other compound in the sub-libraries. By contrast, onlytwo such outliers (41 and 54) were produced by the four block design,and only one (53) by the single block sub-library. In addition, theindividually selected compounds are clearly more evenly spread ingeneral. Finally, note the redundancy indicated by the large clumps ofsingle block-compounds which surround 42, 46 and 48.

[0115] These points could probably be gleaned from summary statisticscalculated “blind” using pairwise distances or other numerical data.However, such analysis would not detect the significant under-samplingof compounds evident in the upper right quadrant circumscribed by 51, 52and 55, particularly in the single block design (large green symbols).The ability to identify such diversity “holes” by direct inspection is amajor advance enabled by the present invention.

[0116] Visual comparisons of such projections also provide a way toassess trade-offs in optimality among factors such as coverage,diversity, synthetic efficiency, cost and redundancy across variationsin sublibrary design parameters (e.g., subsample size k in the OptiSimdesign strategy described here).

[0117] Projecting Biological Activity into Fingerprint Space

[0118] Analyses carried out on literature data sets have clearly shownthat 2D fingerprints exhibit good neighborhood behavior.⁷ Thevisualization method of this invention provides a less abstractdemonstration of this point. To accomplish this, we examined the resultsof assaying a generalized screening library of proprietary kinaseinhibitors against a specific target enzyme, then applying thecombination of PCA and modified NLM projection to fingerprints for 300compounds drawn at random from the pool of inactives together with 100randomly selected actives. The plots obtained are shown in FIG. 10A-C,with actives indicated by larger symbols and inactives by the smallersymbols. FIGS. 10A and 10B show the PCA and direct (no horizon) NLMprojections for this data set, whereas the plot in FIG. 10C was obtainedwith h=0.3.

[0119] There is much more structural diversity among compounds in thekinase data set than is found in the sulfonylpiperidine library, with80% of the pairwise distances between the fingerprints from the kinaselibrary in excess of the maximum pairwise separation (0.714) seen in thecombinatorial one. The large number of long-range interactions involvedreduces the extent of “rounding up” possible in this case when goingfrom the principal components projection (FIG. 10A) to unmodified NLM(FIG. 10B).

[0120] A handful of inactive compounds fall into the cluster of activeswhich includes compounds 56-60, and 70 and 71 are juxtaposed in bothFIG. 10A or FIG. 10B despite the large pairwise separation between them(0.861) in fingerprint space. Applying our modified NLM procedure withh=0.3 (FIG. 10C) removes 61, 62, 71 and other outliers—i.e., compoundswith no neighbors within a Soergel distance of 0.3—into the frame of theplot and purges the inactives from the large cluster of actives to theleft of the plots. Moreover, the stress drops from 9034 to 36 in goingfrom FIG. 10B to 10C. Other compounds have been highlighted as lightblue squares to illustrate how imposing the horizon affects theirdistribution relative to one another.

[0121] A greater proportion of inactives (56%) show up as outliers inFIG. 10C than is the case for the actives (30%), indicating that thedistribution of “hits” is gratifyingly non-random. Of greater interest,however, are the several islands of activity set off from one another byintervening stretches of inactives: good neighborhood behavior impliesthat such islands will be relatively free of inactives, though it doesnot preclude the existence of multiple islands. Nor does it imply thatthe scale of coupling between activity and structure will be the sameeverywhere. Indeed, some of the “shorelines” in FIG. 10C are much moresharply defined than others. Cases in which structural changes as simpleas adding a methyl group produce a dramatic drop or increase inbiological activity represent extreme instances of this, but they in noway disprove the existence of the islands themselves or the continuityof the activity—and lack thereof—on either side of such boundaries.

[0122] Direct examination of the underlying structures shows that eachisland represents a more or less different chemical family from thelarge island at the left of the plot, particularly for those fartherafield. Some of the compounds which make up the smaller islands arequite active and so may represent new lead areas of chemistry ripe formore thorough exploration.

[0123] The inactive compounds make a key contribution to this plot bydefining the “shores” of the islands of activity. Note, in fact, thatthe activity islands are not completely surrounded by inactives. Theunbounded edges of the islands may suggest synthetic directions to takewhich could extend the scope of the chemistries involved. The exactnature of such direction is very context dependent, and is bestidentifying the structures near the unbounded edge. Finding activity forcompound 26 in FIG. 8 with respect to some (hypothetical) targetreceptor would suggest synthesis of methoxymethyl- orhydroxyethylcyclohexyl homologs, or of hydroxymethylcyclopentyl orhydroxymethyltetrahydrofuranyl amine analogs, for example. Findingactivity for 28, on the other hand, would suggest synthesis of pyridoneor furanyl analogs. A quick similarity search carried out against knowninactives would then show whether such compounds do indeed represent areal boundary in structural space.

[0124] No summary statistic which could accomplish this as effectivelyas direct visual inspection of FIG. 10C does is known in the prior art.

[0125] Projecting Pharmacophore Models into Fingerprint Space

[0126] A four-point pharmacophore model for the target enzyme wasformulated in connection with the kinase research project. When thispharmacophore hypothesis was employed as a query in a UNITY 3D flexsearch, it “hit” 67% of the actives and 26% of the inactives, but only1% of the more generalized database of drug-like molecules representedby Chapman and Hall's Directory of Pharmacological Agents. FIG. 10Dshows the plot obtained by applying the modified NLM procedure (h=0.3)of this invention to an initial PCA for all actives which matched theproposed pharmacophore together with “hits” from the same number ofinactives selected at random.

[0127] The actives in FIG. 10D are distributed in a very similar patternto those in FIG. 10C, indicating that the query captures something quitereal about available binding sites on the target enzyme. The similaritybetween the two maps testifies to the value of using PCA to getconsistent starting coordinates and to how robust the unfolding by themodified NLM is. Moreover, the general disorganization of the inactivesaway from the islands of activity indicate that such “hits” are probablynon-specific, in that the structural classes to which they belong tocharacteristically present the pharmacophore of interest.

[0128] Two compounds (61 and 71) which are outliers in FIG. 10C show upin doubleton “islands” in FIG. 10D. This is because all compounds “hit”by the query were used to generate the latter map, whereas only one ofeach pair happened to get selected for the random sample used togenerate the former. The two pairs fall well off to the right in FIG.10D, reflecting their isolation from other “hits” in structural(fingerprint) space.

[0129] The roots of the inadequacy of both PCA and standard NLM forprojecting combinatorial libraries from fingerprint space down into twodimensions become clearer when one considers some details of howcompounds in such libraries are typically distributed in structuralspace and illuminates the reason that introducing an horizon is soeffective.

[0130] To begin with, the useful dynamic range of the Soergel distanceswithin a combinatorial library is limited if there is any scaffold tospeak of. The smallest distance between any two of the 300 compoundsshown in FIGS. 5-8, for example, is 0.163, whereas the largest distanceis only 0.714. This is less than a four-fold range, yet it spans thespectrum from near redundancy in an HTS context to essentially noexpected relationship in biochemical activity.

[0131] In addition, the high dimensionality of fingerprints makes iteasy to generate nearly symmetrical relationships which cannot bedisplayed accurately in two dimensions. All 21 pairwise Soergeldistances between compounds 1, 2, 3, 5, 6, 9 and 11 (FIG. 6), forexample, fall between 0.424 and 0.527. In other words, they form aslightly irregular six dimensional simplex. Even a tetrahedron, which isonly a three dimensional simplex, cannot be projected into twodimensions without severe distortion. Absent interactions with otherpoints, a perfectly regular six dimensional simplex will be projected asa regular heptagon—hence the tendency towards round, isotropic maps when“ordinary” NLM is applied in this situation.

[0132] That long-range, high-dimensional relationships do exist withinthese data sets is clear from the principal component analyses used toderive starting points for the NLM. The first and second principalcomponents obtained for the sulfonylpiperidine library (FIG. 5A) captureonly 5.8 and 4.9%, respectively, of the total variance in thecorresponding fingerprints, for example; extending the projection up toten components (dimensions) only captures 28.7% more, for a total of43.6%. Indeed, it would take a reduced descriptor space of 62 dimensionsto capture 85% of the variance for this data set. PCA statistics fromthe more diverse kinase data set (FIG. 10A) are even more daunting: thefirst two components capture 14.5% of the total variance, the first tencomponents capture 34%, and 108 components are required to account for85% of the original fingerprint variance.

[0133] The modified NLM procedure could, of course, be initiated usingrandom starting coordinates, which would in many cases produceprojections with comparably low stress. The key reason to use principalcomponents is not their explanatory power but the continuity they bringto projections obtained from overlapping subsets: random initializationwould obliterate the commonalities of pattern between FIG. 10C and 10D,for example.

[0134] Cutting off long range effects in these projections byintroducing an horizon allows the maps to relax, essentially by lettingthem unfold. For the modified NLM maps for the 300 compound subset, forexample, the total stress S falls sharply as the horizon shrinks—from5151 for h=1.0 to 4747, 2403, 1151 and 253 for h=0.65, 0.50, 0.40 and0.30, respectively. This reduction comes in part from defining awaylong-range stress, but it can also be interpreted as eliminatingdistracting sources of long-range noise which are irreconcilable anyway.Less information is actually discarded than one might expect: the 9292pairwise distances which fall within the horizon of 0.4 used to createFIG. 5C imply that, on average, each compound “sees” about 136neighbors; 233 neighbors, on average, fall within 0.5 of each compound,and 57 fall within an horizon of 0.3.

[0135] Fifty seven compounds can still support some relatively highdimensional relationships, however. It is evident from the datapresented here that fingerprint spaces defined by chemical libraries ingeneral, and by combinatorial libraries in particular, are locally“flat” networks embedded at all angles in a mostly empty space, somewhatlike the snowflakes making up a snowdrift. That they can be unfoldedwhile preserving local detail and connectivity seems reasonable, giventhe constraints that chemical connectivity and feasibility of synthesisput on incremental structural changes and the vast diversity which issynthetically accessible. The result is that the local dimensionalityaround any single compound is usually much lower than is that of thelibrary as a whole.

[0136] Setting an NLM horizon at or near a Soergel distance of 0.3defines neighborhoods within which the effective dimensionality is lowenough that meaningful projection into two dimensions is possible. It isfortunate that this natural scale of unfolding conserves relationshipsbetween individual structures and between structural classes, while alsomaking possible informative projections of biological activity into theunfolded structural space which results. This will certainly not be thecase for all high-dimensional descriptor spaces; where it does holdtrue, however, the method described in this patent document may provemore generally useful.

[0137] General Considerations of Visualization Methodology:

[0138] The description of the invention thus far has utilizedfingerprints as an example of a high-dimensional molecular descriptorwhich can be visualized in two dimensions. Other descriptors are, ofcourse, well known and can be employed with the method of thisinvention. Four additional high dimensional descriptors can also be usedto illustrate the method of adding new descriptors in a general way.Molecular holograms are simply fingerprints extended to track the numberof occurrences of each fragment to replace the binary presence/absencebit. Holograms have proven to be valuable in predicting activity(Tripos, 1997). Atom pairs (Sheridan et al, 1994) are vectors whichdescribe the number of bonds between all important molecular features.Pharmacophoric triplets (Pickett et al, 1996) intuitively relate to themedicinal chemist's view of how a compound docks at a receptor site, andat least for exploring within chemical series it appears to be usefulfor optimizing compound affinity. The MolConn molecular connectivitydescriptors (Kier and Hall) have a long history of use in small seriesand can now be tested on larger ones.

[0139] However, at the present time not all high-dimensional descriptorsmay be utilized with the combination of PCA and modified NLM. Shapedescriptors are particularly difficult, because the alignment andconformational adjustments involved in finding the best match betweentwo molecules means that a molecule does not have a single shape. Thedistances among three molecules need not obey the triangle inequality(distance from A to B can be larger than the sum of distance A to C plusdistance C to B). Similar behavior occurs in protein homologyscoring—the best sequence alignment for any one protein depends on theother protein. In effect, these unusual descriptors call for each newmolecule to appear at more than one place in the visualization map,since it is seen differently by each molecule to which it is compared.Clearly, however, the method of this invention works well with moleculardescriptors which associate with each molecule a fixed vector ofnumbers.

[0140] The software code to perform the visualization of this inventionis contained in the Code Appendix. The points which form the projectedmap determined by the program may be displayed in Excel or any otherprogram, custom or commercially sold, which can display scatter plots.As noted earlier, additional display code, which does not form a part ofthe present invention, can be implemented in JAVA or some other languageby those skilled in the art to aid in exploring the two dimensionalplots and to provide access to the molecular structure which correspondsto each point in the display. Such code was used to provide FIGS. 2, 8,and 9.

References

[0141]¹ MDL Information Systems, Inc., 146000 Catalina Street, SanLeandro Calif. 94577

[0142]² Daylight Chemical Information Systems, Inc., 27401 Los Altos,Mission Viejo Calif. 92691

[0143]³ UNITY is distributed by Tripos, Inc., 1699 S. Hanley Rd., St.Louis Mo. 63144

[0144]⁴ Willett, P. Chemical similarity searching. J. Chem. Inf. Comput.Sci. 1998, 38, 983-996.

[0145]⁵ Brown, R. D., and Martin, Y. C. Use of structure-activity datato compare structure-based clustering methods and descriptors for use incompound selection. J. Chem. Inf. Comput. Sci. 1996, 36, 572-584.

[0146]⁶ Matter, H., and Lassen, D. Compound libraries for leaddiscovery. Chemica Oggi 1996, 6, 9-15.

[0147]⁷ Patterson, D. E., Cramer, R. D., Ferguson, A. M., Clark, R. D.and Weinberger, L. E. Neighborhood behavior: a useful concept forvalidation of molecular diversity descriptors, J. Med. Chem.1996, 39,3049-3059.

[0148]⁸ Matter, H. Selecting optimally diverse compounds from structuredatabases: a validation study of two-dimensional and three-dimensionalmolecular descriptors. J. Med. Chem. 1997, 40, 1219-1229.

[0149]⁹ Wild, D. J., and Blankley, C. J. Comparison of 2D fingerprinttypes and hierarchy level selecection metrhods for structural groupingusing Ward's clustering. J. Chem. Inf. Comput. Sci., in press.

[0150]¹⁰ Willett, P., and Winterman, V. A comparison of some measuresfor the determination of inter-molecular structural similarity. Quant.Struct.-Act. Relat. 1986, 5, 18-25.

[0151]¹¹ Barnard, J. M., and Downs, G. M. Clustering of chemicalstructures on the basis of two-dimensional similarity measures. J. Chem.Inf. Comput. Sci. 1992, 32, 644-649.

[0152]¹² Gower, J. C. Measures of similarity, dissimilarity anddistance. In: Encyclopedia of Statistical Sciences, Vol 5; Kotz, S., andJohnson, N. L., Eds., John Wiley & Sons, New York, 1985, Vol. 5, pp.397-405.

[0153]¹³ Available Chemicals Directory is distributed by MDL InformationSystems, Inc., 146000 Catalina Street, San Leandro Calif. 94577.

[0154]¹⁴ ChemEnlighten is a registered trademark of Tripos, Inc., St,Louis Mo. 63144

[0155]¹⁵ Lipinski, C. A., Lombardo, F., Dominy, B. W., and Feeney, P. J.Experimental and computational approaches to estimate solubility andpermeability in drug discovery and development settings. Adv. DrugDelivery Rev., 1997, 23, 3-25.

[0156]¹⁶ CLogP is a product of BioByte, Inc., Pomona Corporation

[0157]¹⁷ Patent pending. OptiSim is a registered trademark of Tripos,Inc., 1699 S. Hanley Rd., St. Louis Mo. 63144.

[0158]¹⁸ Clark, R. D. OptiSim: An extended dissimilarity selectionmethod for finding diverse representative subsets. J. Chem. Inf. Comput.Sci. 1997, 37, 1181-1188.

[0159]¹⁹ Clark, R. D. and Langton, W. J. Balancing representativenessagainst diversity using optimizable K-dissimilarity and hierarchicalclustering. J. Chem. Inf. Comput. Sci. 1998, 38, 1079-1086.

[0160]²⁰ Ash, S., Cline, M. A., Homer, R. W., Hurst, T., and Smith, G.B. SYBYL line notation (SLN): a versatile language for chemivcalstructure representation. J. Chem. Inf. Comput. Sci. 1997, 37, 71-79.

[0161]²¹ SYBYL is distributed by Tripos, Inc., St, Louis Mo. 63144

[0162]²² Judson, R. Genetic algorithms and their use in chemistry. In:Reviews in computational chemistry, Lipkowitz, K. B., and Boyd, D. B.,Eds., VCH Publishers, New York, 1997, Vol. 10, pp. 1-73.

[0163]²³ Martin, E. J., Blaney, J. M., Siani, M. A., Spellmeyer, D. C.,Wong, A. K., and Moos, W. H. Measuring diversity: experimental design ofcombinatorial libraries for drug discovery. J. Med. Chem. 1995, 38,1431-1436.

[0164]²⁴ Shemetulskis, N. E., Dunbar, J. B., Jr., Dunbar, B. W.,Moreland, D. W., and Humblet, C. Enhancing the diversity of a corporatedatabase using chemical database clustering and analysis. J.Comput.-Aided Mol. Design 1995, 9, 407-416.

[0165]²⁵ Sammon, J. W. A nonlinear mapping for data structure analysis.IEEE Trans. Comput. 1969, C-18, 401-409.

[0166]²⁶ B. R. Kowalski and C. F. Bender. J. Am. Chem. Soc. 1973, 95,686-692.

[0167]²⁷ D. Domine, J. Devillers, M. Chastrette and W. Karcher.Non-linear mapping for structure-activity and structure-propertymodelling. J. Chemometrics 1993, 7, 227-242.

[0168]²⁸ Hudson, B., Livingstone, D. J., and Rahr, E. Patternrecognition display methods for the analysis of computed molecularproperties. J. Comput. Aided Mol. Design 1989, 3, 55-65.

[0169]²⁹ Agrafiotis, D. K. Stochastic algorithms for maximizingmolecular diversity. J. Chem. Inf. Comput. Sci. 1997, 37, 841-851.

[0170]³⁰ Flower, D. R. On the properties of bit string-based measures ofchemical similarity. J. Chem. Inf. Comput. Sci. 1998, 38, 379-386.

[0171]³¹ Patent pending.

[0172]³² Gillet, V. J., Willett, P., and Bradshaw, J. The effectivenessof reactant pools for generating structurally-diverse combinatoriallibraries. J. Chem. Inf. Comput. Sci. 1997, 37, 731-741.

[0173]³³ Delaney, J. S. Assessing the ability of chemical similaritymeasures to discriminate between active and inactive compounds. Mol.Diversity 1995, 1, 217-222. TABLE 1 Substructure exclusions included inthe files specified by the -notlist option in 2D UNITY searches UNITYQuery SLN for excluded substructures Targets CH2N[f]H2CHN[not═NHC(═O)].C[not═C:Any]NH polyamines CS(═O)(═O)Cl CHN[not═NHC(═O)]free amines S(═O)(═O)Hal. S(═O)(═O)Hal polysulfonyl halides Both C(═O)OHfree acids C(═O)O[f] carboxylate salts C(═Het)Hal reactive halides OH.OHpolyols C(═Het)NH.C(═Het)NH N[not═NHC(═O)]HN[not═NHC(═O)]H hydrazinesC(═Het)N.C(═Het)N.C(═Het)N peptides C[is═C-Any═:Any]HZ{Z:Cl,Br,I}activated halides N(˜O[f])˜O[f] nitro compounds F.F.F.F.F.F.Fperfluoroalkyls > C2 CCCCCCCCH3 long alkyls H[I═2] heavy isotopes H[I═3]″ C[I═13] ″ C[I═14] ″ N[I═15] ″ S[I═35] ″ P[I═32] ″

[0174] TABLE 2 Statistics and secondary filters applied to primaryreagent lists. Primary Amines Sulfonyl Chlorides Property Cutoff PassedCutoff Passed Single structure — 436 − 178 Molecular weight 200 361 350163 Molecular volume (Å³) 190 363 255 165 ClogP 2.6 370 5.0 168 Aromaticring count 1 394 2 171 Combined filters — 308 — 154

What is claimed is:
 1. A method of visualizing in two or threedimensions the relationships in diversity space of compounds which arecharacterized by high-dimensional descriptors comprising the steps of:a. selecting a representative subset of the compounds; b. calculating orretrieving from a data base the descriptor characteristics for eachcompound of the subset selected; c. generating a distance matrix betweenall compounds or utilizing a function to generate the distance matrixelements as needed; d. computing a hierarchy of clusters, definingcluster centers and partitioning each set of compounds at each level ofclustering; e. performing a PCA projection onto the first twocomponents; f. using a modified stress function which reflects ahorizon, running an NLM refinement from the initial PCA coordinates; andg. graphically displaying the coordinates of each compound as determinedby the NLM refinement.